Neural network target feature detection

ABSTRACT

A method of training a neural network for detecting target features in images is described. The neural network is trained using a first data set that includes labeled images, where at least some of the labeled images having subjects with labeled features, including: dividing each of the labeled images of the first data set into a respective plurality of tiles, and generating, for each of the plurality of tiles, a plurality of feature anchors that indicate target features within the corresponding tile. Target features that correspond to the plurality of feature anchors are detected in a second data set of unlabeled images. Images of the second data set having target features that were not detected are labeled. A third data set that includes the first data set and the labeled images of the second data set is generated. The neural network is trained using the third data set.

BACKGROUND

Face detection is one of the earliest computer vision algorithms used inindustry. Currently, it is used in many application including, forexample, camera image signal processing, surveillance cameras, computeraccess authentications, robotics, and artificial intelligence basedcameras. Many recent face detection algorithms rely upon machinelearning approaches, such as neural networks, because of their accuracy.However, running time and power consumption of a neural network for facedetection may keep this approach from being implemented on manyin-device applications.

SUMMARY

In accordance with some examples of the present disclosure, a method oftraining a neural network for detecting target features in images isdescribed. The neural network is trained using a first data set thatincludes labeled images, where at least some of the labeled imageshaving subjects with labeled features, including: dividing each of thelabeled images of the first data set into a respective plurality oftiles, and generating, for each of the plurality of tiles, a pluralityof feature anchors that indicate target features within thecorresponding tile. Target features that correspond to the plurality offeature anchors are detected in a second data set of unlabeled images.Images of the second data set having target features that were notdetected are labeled. A third data set that includes the first data setand the labeled images of the second data set is generated. The neuralnetwork is trained using the third data set.

In accordance with some examples of the present disclosure, a system fortraining a neural network for detecting target features in images isdescribed. The system comprises a processor, and a memory storingcomputer-executable instructions that when executed by the processorcause the system to: train the neural network using a first data setthat includes labeled images, at least some of the labeled images havingsubjects with labeled features, including dividing each of the labeledimages into a respective plurality of tiles, and generating, for each ofthe plurality of tiles, a plurality of feature anchors that indicatetarget features within the corresponding tile; detecting target featuresthat correspond to the plurality of feature anchors in a second data setof unlabeled images; labeling images of the second data set havingtarget features that were not detected; generating a third data set thatincludes the first data set of labeled images and the labeled images ofthe second data set; and training the neural network using the thirddata set.

In accordance with some examples of the present disclosure, an imageprocessing system that includes a neural network implemented on acomputer for feature detection is described. The image processing systemincludes a convolutional neural network having a plurality of layersstacked sequentially, including: a first set of layers, each layer ofthe first set of layers having a depth-wise convolution and a point-wiseconvolution, wherein the first set of layers is a first subset of adifferent neural network; and a second set of layers after the first setof layers, each layer of the second set of layers having a point-wiseconvolution.

This summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

Non-limiting and non-exhaustive examples are described with reference tothe following Figures.

FIG. 1 depicts an example of an image processing system that isconfigured to detect target features of subjects, according to anembodiment.

FIG. 2 depicts an example of a computing device that includes a featuredetection engine configured to detect target features of subjects,according to an embodiment.

FIG. 3 depicts an example of a pre-processor for a feature detectionengine, according to an embodiment.

FIG. 4 depicts an example of a neural network model for a featuredetection engine, according to an embodiment.

FIG. 5 depicts an example of a neural network for detecting targetfeatures of an image, according to an embodiment.

FIG. 6 depicts an example of a neural network for a feature detectionengine, according to an embodiment.

FIG. 7 depicts an example of a post-processor for a feature detectionengine, according to an embodiment.

FIG. 8 depicts details of a method of training a neural network fordetecting target features in images, according to an embodiment.

FIG. 9 is a block diagram illustrating physical components (e.g.,hardware) of a computing device with which aspects of the disclosure maybe practiced.

FIGS. 10 and 11 illustrate a mobile computing device, according to anembodiment.

DETAILED DESCRIPTION

In the following detailed description, references are made to theaccompanying drawings that form a part hereof, and in which are shown byway of illustrations specific embodiments or examples. These aspects maybe combined, other aspects may be utilized, and structural changes maybe made without departing from the present disclosure. Embodiments maybe practiced as methods, systems, or devices. Accordingly, embodimentsmay take the form of a hardware implementation, an entirely softwareimplementation, or an implementation combining software and hardwareaspects. The following detailed description is therefore not to be takenin a limiting sense, and the scope of the present disclosure is definedby the appended claims and their equivalents.

Aspects of the present disclosure are directed to detecting targetfeatures in received images. For example, a computing device receivesimages from an image sensor and detects a body, face, eyes, hands, orother features of subjects within the images. In accordance withexamples of the present disclosure, a computing device utilizes afeature detection engine that detects the target features using a neuralnetwork model. The feature detection engine may include a pre-processorthat resizes an original image and changes one or more color parameters(e.g., color scale or color representation) to generate an input imagefor the neural network model. The feature detection engine may alsoinclude a post-processor that separates information for differentdetected target features and scales respective bounding boxes for thetarget features to the original image.

In accordance with embodiments of the present disclosure, FIG. 1 depictsan example of an image processing system 100 that is configured todetect target features in images. Example target features includebodies, faces, eyes, hands, or other features of people or animals, invarious embodiments. In some embodiments, the image processing system100 is configured to detect both faces and bodies of people. In otherembodiments, different combinations of target features may be detectedby the image processing system according to labels that are associatedwith images for training, as described below. The image processingsystem 100 includes a computing device 110 and a server 120. In someembodiments, the image processing system 100 also includes a data store160. A network 150 communicatively couples computing device 110, server120, and data store 160. The network 150 may comprise one or morenetworks such as local area networks (LANs), wide area networks (WANs),enterprise networks, the Internet, etc., and may include one or more ofwired, wireless, and/or optical portions.

Computing device 110 may be any type of computing device, including amobile computer or mobile computing device (e.g., a Microsoft® Surface®device, a laptop computer, a notebook computer, a tablet computer suchas an Apple iPad™, a netbook, etc.), or a stationary computing devicesuch as a desktop computer or PC (personal computer). Computing device110 may be configured to execute one or more software applications (or“applications”) and/or services and/or manage hardware resources (e.g.,processors, memory, etc.), which may be utilized by users (e.g.,customers) of the server 120. Server 120 may include one or more serverdevices, distributed computing platforms, and/or other computingdevices.

The computing device 110 may include a feature detection engine 112 thatreceives images and processes those images to detect and identify targetfeatures. In some scenarios, the feature detection engine 112 provides abounding box that surrounds a detected target feature. In an embodiment,the feature detection engine 112 is configured to utilize a neuralnetwork model, such as a neural network model 162, described below. Theserver 120 includes a feature detection engine 122, which may be thesame, or similar to, the feature detection engine 112.

In accordance with examples of the present disclosure, the featuredetection engine 112 may receive one or more images and provide them toa neural network model executing at a neural processing unit. The neuralnetwork model may output detection information for one or more detectedtarget features, as described below. Because the neural processing unitis specifically designed and/or programmed to process neural networktasks, the consumption of resources, such as power and/or computingcycles, is less than the consumption would be if a central processingunit were used.

The data store 160 is configured to store data, for example, the neuralnetwork model 162 and source images 164. In various embodiments, thedata store 160 is a network server, cloud server, network attachedstorage (“NAS”) device, or other suitable computing device. Data store160 may include one or more of any type of storage mechanism, includinga magnetic disc (e.g., in a hard disk drive), an optical disc (e.g., inan optical disk drive), a magnetic tape (e.g., in a tape drive), amemory device such as a random access memory (RAM) device, a read-onlymemory (ROM) device, etc., and/or any other suitable type of storagemedium. Although only one instance of the data store 160 is shown inFIG. 1, the image processing system 100 may include two, three, or moresimilar instances of the data store 160. Moreover, the network 150 mayprovide access to other data stores, similar to data store 160 that arelocated outside of the image processing system 100, in some embodiments.

The neural network model 162 is configured to detect target features inreceived images. In some scenarios, the neural network model 162 istrained to detect target features using the source images 164. Forexample, the source images 164 include various images, at least some ofwhich include bodies, faces, eyes, hands, or other features of subjects(e.g., people or animals) within the image, and the neural network model162 is trained to determine a bounding box for the detected targetfeature. In some embodiments, the neural network model 162 is alsoconfigured to determine a confidence level of the detection (e.g., 95%confident). The data store 160 includes a neural network model 162 andsource images 164 for training the neural network model 162, in someembodiments. In other embodiments, the source images 164 are omittedfrom the data store 160, but are stored in another suitable storage.

FIG. 2 depicts an example of a computing device 200 that includes afeature detection engine 202 configured to detect target features ofsubjects in an original image 210, according to an embodiment. Thecomputing device 200 may correspond to the computing device 110 and/orserver 120 and the feature detection engine 202 may correspond to thefeature detection engine 112 and/or 122, in various embodiments. Thecomputing device 200 may also include a central processing unit (CPU)404 and a neural processing unit (NPU) 408. The feature detection engine202 includes a pre-processor 220, a neural network model 230, and apost-processor 240. The pre-processor 220 is configured to resize anoriginal image and change one or more color parameters (e.g., colorscale or color representation) to generate an input image for the neuralnetwork model 230. The neural network model 230 generally corresponds tothe neural network model 162, in an embodiment. The post-processor 240is configured to receive and process an output from the neural networkmodel 230 and provide estimated feature locations for the targetfeatures, for example, as a bounding box 250 around a detected face.

In some embodiments, the feature detection engine 202 may executeprocessing at the CPU 204, without utilizing the NPU 208. In one suchembodiment, a structure of the neural network model 230 is readilyexecuted by the CPU 204 and the NPU 208 is omitted from the computingdevice 200. In other embodiments, the feature detection engine 202 mayexecute processing at the CPU 204 and/or the NPU 208. For example,processing of the neural network model 230 may occur at the NPU 408. TheNPU 208, being configured to efficiently execute processing associatedwith neural network models, may allow the feature detection engine 202to operate in or near real-time such that a face or body of a subjectwithin an image may be detected in or near real-time without consumingresources traditionally expended by the CPU 204.

FIG. 3 depicts an example of a pre-processor 300 for a feature detectionengine, according to an embodiment. The pre-processor 300 may correspondto the pre-processor 220 of the feature detection engine 202, in someembodiments. The pre-processor 300 includes a resize processor 320, acolor scale converter 330, and a color representation normalizer 340.The resize processor 320 receives an original image 310 (e.g., a sourceimage 164, original image 210) and resizes the original image 310according to a structure of the neural network model 230. In theembodiments described herein, the neural network model 162 and neuralnetwork model 230 are configured to process images of size 352×352pixels with three values that represent color (red, green, blue) at eachpixel. Accordingly, the resize processor 320 is configured to resize theoriginal image to 352×352 pixels. The resize processor 320 is configuredto perform bilinear interpolation, bicubic interpolation, sincresampling, box sampling, or other suitable resize procedures to resizethe original image, in various embodiments.

After resizing the original image 310, the color scale converter 330converts a color scale of the resized image to an RGB planar format, insome embodiments. In this format, color data for an individual pixel isspread across different bitplanes. In some embodiments, the color scaleconverter 330 is omitted. The color representation normalizer 340normalizes the colors for each pixel of the resized image to have valuesfrom −1 to +1 as a floating point value, instead of an integer value of0 to 255. In other words, the color representation normalizer 340 maps ared value of 255 to a value of 1.0 and maps a red value of 0 to −1.0.The resized, color converted, and normalized image is referred to as theinput image 350, which is provided to the neural network model (e.g.,neural network model 230).

FIG. 4 depicts an example of a neural network model 400 for a featuredetection engine, according to an embodiment. The neural network model400 generally corresponds to the neural network model 162 and/or 230, insome embodiments. The neural network model 400 is configured to processan input image 410 and detect target features in the input image. Theneural network model 400 divides images into a plurality of tiles andgenerates a plurality of feature anchors based on an input image. Theneural network model 400 is configured to divide the input image 410into a matrix of 11×11 tiles (121 total tiles) where each tile has aheight of 32 pixels and a width of 32 pixels. In other embodiments, adifferent number of tiles are used, for example, 8×8 (64 tiles), 15×15(225 tiles), or other suitable number of tiles.

Each of the tiles corresponds to a respective plurality of featureanchors, where a feature anchor is a data structure that represents apossible location within the tile where an instance of a correspondingfeature may be located. In the embodiments described herein, a featureanchor includes five elements: an x coordinate, a y coordinate, a width,a height, and a confidence level. In some examples, the x, y coordinatesindicate a center of a bounding box that contains the target feature andhas the corresponding width and height. In other embodiments, the x, ycoordinates of the feature anchor correspond to a different referencepoint for the bounding box, for example, an upper left corner, lowerright corner, etc. As described above, a target feature may be a face,body, or other element of a subject within the input image. Accordingly,the feature anchors may correspond to face anchors for detecting humanfaces, body anchors for detecting human bodies, or combinations of bothface and body anchors. In the embodiments described herein, the neuralnetwork model 400 utilizes seven total anchors, with 5 anchors for bodydetection and 2 anchors for face detection. In other words, the bodyanchors indicate a possible location of a body within a tile and theface anchors indicate a possible location of a face within the tile.

The neural network model 400 outputs the feature anchors as tile data450 having dimensions of A×B×C, where A×B corresponds to the number oftiles and C corresponds to five times a maximum number of targetfeatures that may be detected within each tile. In other words, tiles of11×11 with seven feature anchors corresponds to tile data 450 havingdimensions of 11×11×35.

Although existing neural networks are able to detect faces and otherfeatures, their speed and complexity may not allow for their use oncomputing devices without dedicated neural network processingcapabilities. The neural network model 400 overcomes this limitation byincluding a neural network having a plurality of layers stackedsequentially, formed from a first subset network 420 and a “compressed”network 430. The first subset network 420 includes a first set of layersthat represents a first subset of a different neural network, such asthe MobileNet neural network, and a second set of layers that is basedupon a compression of a remainder subset of the different neuralnetwork.

The MobileNet neural network generally includes 13 convolution layers,where each layer includes a depth-wise convolution and a point-wiseconvolution. In the neural network model 400, the first subset network420 includes only a first four layers of the MobileNet neural network,with an output of these layers being provided to the compressed network430, which includes a modified, compressed representation of a remainingnine layers of the MobileNet neural network. In some scenarios, theMobileNet neural network may be reduced in computational cost and thenumber of parameters using a width multiplier that reduces complexity ateach layer. However, this approach also reduces accuracy. Rather thansimply reducing a “width” of each layer, the neural network model 400uses the first four convolution layers without modification, butremaining convolution layers (9 layers) are compressed to a point-wiseconvolution (i.e., instead of a depth-wise convolution and point-wiseconvolution).

FIG. 5 depicts an example of a neural network 500 for detecting targetfeatures of an image, according to an embodiment. The neural network 500generally corresponds to the first subset network 420 of the neuralnetwork model 400 and illustrates the first four layers of the MobileNetneural network. Although the neural network 500 includes four layers ofthe MobileNet neural network, in other embodiments, the neural network500 includes a larger subset of the MobileNet neural network, forexample, five, six, or more layers.

FIG. 6 depicts an example of a neural network 600 for a featuredetection engine, according to an embodiment. The neural network 600generally corresponds to the compressed network 430 of the neuralnetwork model 400. In the illustrated embodiment, the neural network 600includes layers four through thirteen of point-wise convolutions.

FIG. 7 depicts an example of a post-processor 700 for a featuredetection engine, according to an embodiment. The post-processor 700generally corresponds to the post-processor 240, in some embodiments.The post-processor 700 receives tile data 710 from a neural networkmodel and converts the tile data 710 to estimate feature locations 760.In an embodiment, for example, the post-processor 700 receives tile data450 from the neural network model 400 or the tile data 650 from theneural network 600. The tile data 710 may be a binary output havingdimensions of 11×11×35, corresponding to a set of feature anchors foreach tile. These dimensions correspond to the example embodiment wherethe input image is divided into 11×11 tiles, with each tile having sevenfeature anchors (×35, for seven feature anchors each having fivevalues). In other embodiments having different numbers of tiles andfeature anchors, the dimensions of the tile data 710 are adjustedaccordingly, for example, the tile data 710 may have dimensions of8×8×10 for 8×8 tiles with two feature anchors.

The post-processor 700 includes a feature anchor separator 720, arescaler 730, a parser 740, and a resizer 750. The feature anchorseparator 720 splits the binary format of the tile data 710 intoseparate feature anchors having an x coordinate, y coordinate, width,height, and confidence level. After splitting, the rescaler 730 rescalesthe x, y coordinates and confidence level of the feature anchors using asigmoid function. The parser 740 parses the bounding boxes defined bythe x, y coordinates, height, and width and removes bounding boxes withconfidence levels that do not meet a minimum confidence threshold (e.g.,discard those having less than 80% confidence). For those bounding boxesthat meet the minimum confidence threshold, the resizer 750 resizes thebounding box relative to the size of the original image. As an example,when the size of the original image is 1920×1080 which has been resizeddown to 352×352 by the pre-processor 300, the resizer 750 resizes abounding box of 12×14 pixels to a bounding box of 65×43 pixels andprovides the estimated feature locations as a list of arrays.

FIG. 8 depicts details of a method 800 for training a neural network fordetecting target features in images, according to an embodiment. Ageneral order for the steps of the method 800 is shown in FIG. 8. Themethod 800 may include more or fewer steps or may arrange the order ofthe steps differently than those shown in FIG. 8. The method 800 can beexecuted as a set of computer-executable instructions executed by acomputer system and encoded or stored on a computer readable medium. Inexamples, aspects of the method 800 are performed by one or moreprocessing devices, such as a computer or server. Further, the method800 can be performed by gates or circuits associated with a processor,Application Specific Integrated Circuit (ASIC), a field programmablegate array (FPGA), a system on chip (SOC), a neural processing unit, orother hardware device. Hereinafter, the method 800 shall be explainedwith reference to the systems, components, modules, software, datastructures, user interfaces, etc. described in conjunction with FIGS.1-7.

The method starts at step 810, where the neural network is trained usinga first data set that includes labeled images. In various embodiments,the neural network corresponds to the neural network models 162, 230,and/or 400. At least some of the labeled images have subjects withlabeled features. In some embodiments, the first data set corresponds toa 2017 COCO dataset for face and body.

Training the neural network also includes dividing (step 812) each ofthe labeled images of the first data set into a respective plurality oftiles and generating (step 814), for each of the plurality of tiles, aplurality of feature anchors that indicate target features within thecorresponding tile. In some embodiments, each of the plurality offeature anchors indicates a bounding box within the corresponding tilethat contains a target feature. In some examples, the bounding boxcorresponds to the bounding box 250. In some embodiments, the targetfeatures include a subject face and/or subject body.

In some embodiments, training the neural network using the first dataset of labeled images includes normalizing RGB values of the labeledimages from 0 to 255 to −1 to 1. In an embodiment, for example, thecolor representation normalizer 340 normalizes the RGB values.

At step 820, target features that correspond to the plurality of featureanchors are detected in a second data set of unlabeled images. In someembodiments, the second data set is generated to include images fromvideos without people. In some scenarios, by using images without people(and thus without target features), the second data set is configured toevoke false positive detections of target features. In an embodiment,the second data set is generated to include images from videos thatdepict subjects with different head poses. For example, the videosdepict different subjects rotating their face in different directions infront of a camera. In some scenarios, these images provide improveddetection accuracy for images where a subject is not looking directlyinto the camera.

At step 830, images of the second data set having target features thatwere not detected are labeled. For example, images of the second dataset having a face or body are labeled with bounding boxes that surroundthe target feature. In some embodiments, labeling the target featuresthat were not detected is performed manually. In other embodiments, adifferent neural network is used to label the target features that werenot detected.

At step 840, a third data set that includes the first data set and thelabeled images of the second data set is generated. In some embodiments,the third data set is generated to include images of the second data setthat correspond to false positive detections of the target features. Inother words, after inserting images into the second data set that evokefalse positives, those false positives are then used to retrain andimprove the accuracy of the neural network.

In some embodiments, generating the third data set includes performing arandomized crop of different aspect ratios on at least some of the thirddata set. As discussed above, the original image is resized to 352×352as an input image to be provided to the neural network model, so theoriginal image may be stretched horizontally or vertically based on itsaspect ratio. By introducing a randomized crop of different aspectratios on some images, the neural network model is made to be morerobust against a range of input image aspect ratios. The differentaspect ratios may include, for example, 16:9, 4:3, 1:1, 3:4, 9:16, orother suitable aspect ratios.

In some embodiments, generating the third data set includes generatingat least some images having light levels below a low light threshold,which further includes augmenting an image of the third data set to havelight levels below the low light threshold. In other words, some imagesare augmented to have lower light levels than their original lightlevels, which allows the neural network model to be trained to haveimproved detection of target features in low light conditions.

In some embodiments, generating the third data set includes generatingat least some images having partially occluded target features, whichfurther includes augmenting an image of the third data set to have apartially occluded target feature. In other words, an image of acomplete face may be augmented to “hide” at least part of the face,which allows the neural network model to be trained to have improveddetection of target features that are hidden by obstructions (e.g., aface mask, a hand over a lens). In one such embodiment, augmenting theimage includes cropping the image or inserting a block into the image toobtain the partially occluded target feature.

At step 850, the neural network is trained using the third data set.

In some embodiments, training the neural network using the third dataset includes training the neural network using floating point values forweights of the neural network. In an embodiment, the method furtherincludes quantizing the weights of the neural network using integers. Insome scenarios, this approach reduces processing times for detection ofthe target features on processors that do not have a neural processingunit, allowing the neural network to be used on a wider range ofcomputing devices.

In some embodiments, the steps 820, 830, 840, and 850 are repeated oneor more times to further improve the detection accuracy of the neuralnetwork models 162, 230, and/or 400.

FIG. 9 is a block diagram illustrating physical components (e.g.,hardware) of a computing device 900 with which aspects of the disclosuremay be practiced. The computing device components described below mayhave computer executable instructions for implementing a featuredetection application 920 on a computing device (e.g., computing device110, server 120), including computer executable instructions for featuredetection application 920 that can be executed to implement the methodsdisclosed herein. In a basic configuration, the computing device 900 mayinclude at least one processing unit 902 and a system memory 904.Depending on the configuration and type of computing device, the systemmemory 904 may comprise, but is not limited to, volatile storage (e.g.,random access memory), non-volatile storage (e.g., read-only memory),flash memory, or any combination of such memories. The system memory 904may include an operating system 905 and one or more program modules 906suitable for running feature detection application 920, such as one ormore components with regard to FIGS. 1-2 and, in particular, featuredetection engine 112 and/or 122.

The operating system 905, for example, may be suitable for controllingthe operation of the computing device 900. Furthermore, embodiments ofthe disclosure may be practiced in conjunction with a graphics library,other operating systems, or any other application program and is notlimited to any particular application or system. This basicconfiguration is illustrated in FIG. 9 by those components within adashed line 908. The computing device 900 may have additional featuresor functionality. For example, the computing device 900 may also includeadditional data storage devices (removable and/or non-removable) suchas, for example, magnetic disks, optical disks, or tape. Such additionalstorage is illustrated in FIG. 9 by a removable storage device 909 and anon-removable storage device 910.

As stated above, a number of program modules and data files may bestored in the system memory 904. While executing on the processing unit902, the program modules 906 (e.g., feature detection application 920)may perform processes including, but not limited to, the aspects, asdescribed herein. Other program modules that may be used in accordancewith aspects of the present disclosure, and in particular for featuredetection application 920, may include feature detection engine 921,etc.

Furthermore, embodiments of the disclosure may be practiced in anelectrical circuit comprising discrete electronic elements, packaged orintegrated electronic chips containing logic gates, a circuit utilizinga microprocessor, or on a single chip containing electronic elements ormicroprocessors. For example, embodiments of the disclosure may bepracticed via a system-on-a-chip (SOC) where each or many of thecomponents illustrated in FIG. 9 may be integrated onto a singleintegrated circuit. Such an SOC device may include one or moreprocessing units, graphics units, communications units, systemvirtualization units and various application functionality all of whichare integrated (or “burned”) onto the chip substrate as a singleintegrated circuit. When operating via an SOC, the functionality,described herein, with respect to the capability of client to switchprotocols may be operated via application-specific logic integrated withother components of the computing device 900 on the single integratedcircuit (chip). Embodiments of the disclosure may also be practicedusing other technologies capable of performing logical operations suchas, for example, AND, OR, and NOT, including but not limited tomechanical, optical, fluidic, and quantum technologies. In addition,embodiments of the disclosure may be practiced within a general purposecomputer or in any other circuits or systems.

The computing device 900 may also have one or more input device(s) 912such as a keyboard, a mouse, a pen, a sound or voice input device, atouch or swipe input device, etc. The output device(s) 914 such as adisplay, speakers, a printer, etc. may also be included. Theaforementioned devices are examples and others may be used. Thecomputing device 900 may include one or more communication connections916 allowing communications with other computing devices 950. Examplesof suitable communication connections 916 include, but are not limitedto, radio frequency (RF) transmitter, receiver, and/or transceivercircuitry; universal serial bus (USB), parallel, and/or serial ports.

The term computer readable media as used herein may include computerstorage media. Computer storage media may include volatile andnonvolatile, removable and non-removable media implemented in any methodor technology for storage of information, such as computer readableinstructions, data structures, or program modules. The system memory904, the removable storage device 909, and the non-removable storagedevice 910 are all computer storage media examples (e.g., memorystorage). Computer storage media may include RAM, ROM, electricallyerasable read-only memory (EEPROM), flash memory or other memorytechnology, CD-ROM, digital versatile disks (DVD) or other opticalstorage, magnetic cassettes, magnetic tape, magnetic disk storage orother magnetic storage devices, or any other article of manufacturewhich can be used to store information and which can be accessed by thecomputing device 900. Any such computer storage media may be part of thecomputing device 900. Computer storage media does not include a carrierwave or other propagated or modulated data signal.

Communication media may be embodied by computer readable instructions,data structures, program modules, or other data in a modulated datasignal, such as a carrier wave or other transport mechanism, andincludes any information delivery media. The term “modulated datasignal” may describe a signal that has one or more characteristics setor changed in such a manner as to encode information in the signal. Byway of example, and not limitation, communication media may includewired media such as a wired network or direct-wired connection, andwireless media such as acoustic, radio frequency (RF), infrared, andother wireless media.

FIGS. 10 and 11 illustrate a mobile computing device 1000, for example,a mobile telephone, a smart phone, wearable computer (such as a smartwatch), a tablet computer, a laptop computer, and the like, with whichembodiments of the disclosure may be practiced. In some aspects, theclient may be a mobile computing device. With reference to FIG. 10, oneaspect of a mobile computing device 1000 for implementing the aspects isillustrated. In a basic configuration, the mobile computing device 1000is a handheld computer having both input elements and output elements.The mobile computing device 1000 typically includes a display 1005 andone or more input buttons 1010 that allow the user to enter informationinto the mobile computing device 1000. The display 1005 of the mobilecomputing device 1000 may also function as an input device (e.g., atouch screen display). If included, an optional side input element 1015allows further user input. The side input element 1015 may be a rotaryswitch, a button, or any other type of manual input element. Inalternative aspects, mobile computing device 1000 may incorporate moreor less input elements. For example, the display 1005 may not be a touchscreen in some embodiments. In yet another alternative embodiment, themobile computing device 1000 is a portable phone system, such as acellular phone. The mobile computing device 1000 may also include anoptional keypad 1035. Optional keypad 1035 may be a physical keypad or a“soft” keypad generated on the touch screen display. In variousembodiments, the output elements include the display 1005 for showing agraphical user interface (GUI), a visual indicator 1020 (e.g., a lightemitting diode), and/or an audio transducer 1025 (e.g., a speaker). Insome aspects, the mobile computing device 1000 incorporates a vibrationtransducer for providing the user with tactile feedback. In yet anotheraspect, the mobile computing device 1000 incorporates input and/oroutput ports, such as an audio input (e.g., a microphone jack), an audiooutput (e.g., a headphone jack), and a video output (e.g., aHigh-Definition Multimedia Interface port) for sending signals to orreceiving signals from an external device.

FIG. 11 is a block diagram illustrating the architecture of one aspectof a mobile computing device. That is, the mobile computing device 1000can incorporate a system (e.g., an architecture) 1002 to implement someaspects. In one embodiment, the system 1002 is implemented as a “smartphone” capable of running one or more applications (e.g., browser,e-mail, calendaring, contact managers, messaging clients, games, andmedia clients/players). In some aspects, the system 1002 is integratedas a computing device, such as an integrated personal digital assistant(PDA) and wireless phone.

One or more application programs 1066 may be loaded into the memory 1062and run on or in association with the operating system 1064. Examples ofthe application programs include phone dialer programs, e-mail programs,personal information management (PIM) programs, word processingprograms, spreadsheet programs, Internet browser programs, messagingprograms, and so forth. The system 1002 also includes a non-volatilestorage area 1068 within the memory 1062. The non-volatile storage area1068 may be used to store persistent information that should not be lostif the system 1002 is powered down. The application programs 1066 mayuse and store information in the non-volatile storage area 1068, such asemail or other messages used by an email application, and the like. Asynchronization application (not shown) also resides on the system 1002and is programmed to interact with a corresponding synchronizationapplication resident on a host computer to keep the information storedin the non-volatile storage area 1068 synchronized with correspondinginformation stored at the host computer. As should be appreciated, otherapplications may be loaded into the memory 1062 and run on the mobilecomputing device 1000, including the instructions for allocating trafficto communication links (e.g., offline routing engine, online routingengine, etc.).

The system 1002 has a power supply 1070, which may be implemented as oneor more batteries. The power supply 1070 may further include an externalpower source, such as an AC adapter or a powered docking cradle thatsupplements or recharges the batteries.

The system 1002 may also include a radio interface layer 1072 thatperforms the function of transmitting and receiving radio frequencycommunications. The radio interface layer 1072 facilitates wirelessconnectivity between the system 1002 and the “outside world,” via acommunications carrier or service provider. Transmissions to and fromthe radio interface layer 1072 are conducted under control of theoperating system 1064. In other words, communications received by theradio interface layer 1072 may be disseminated to the applicationprograms 1066 via the operating system 1064, and vice versa.

The visual indicator 1020 may be used to provide visual notifications,and/or an audio interface 1074 may be used for producing audiblenotifications via an audio transducer 1025 (e.g., audio transducer 1025illustrated in FIG. 10). In the illustrated embodiment, the visualindicator 1020 is a light emitting diode (LED) and the audio transducer1025 may be a speaker. These devices may be directly coupled to thepower supply 1070 so that when activated, they remain on for a durationdictated by the notification mechanism even though the processor 1060and other components might shut down for conserving battery power. TheLED may be programmed to remain on indefinitely until the user takesaction to indicate the powered-on status of the device. The audiointerface 1074 is used to provide audible signals to and receive audiblesignals from the user. For example, in addition to being coupled to theaudio transducer 1025, the audio interface 1074 may also be coupled to amicrophone to receive audible input, such as to facilitate a telephoneconversation. In accordance with embodiments of the present disclosure,the microphone may also serve as an audio sensor to facilitate controlof notifications, as will be described below. The system 1002 mayfurther include a video interface 1076 that enables an operation ofperipheral device 1030 (e.g., on-board camera) to record still images,video stream, and the like.

A mobile computing device 1000 implementing the system 1002 may haveadditional features or functionality. For example, the mobile computingdevice 1000 may also include additional data storage devices (removableand/or non-removable) such as, magnetic disks, optical disks, or tape.Such additional storage is illustrated in FIG. 11 by the non-volatilestorage area 1068.

Data/information generated or captured by the mobile computing device1000 and stored via the system 1002 may be stored locally on the mobilecomputing device 1000, as described above, or the data may be stored onany number of storage media that may be accessed by the device via theradio interface layer 1072 or via a wired connection between the mobilecomputing device 1000 and a separate computing device associated withthe mobile computing device 1000, for example, a server computer in adistributed computing network, such as the Internet. As should beappreciated such data/information may be accessed via the mobilecomputing device 1000 via the radio interface layer 1072 or via adistributed computing network. Similarly, such data/information may bereadily transferred between computing devices for storage and useaccording to well-known data/information transfer and storage means,including electronic mail and collaborative data/information sharingsystems.

As should be appreciated, FIGS. 10 and 11 are described for purposes ofillustrating the present methods and systems and is not intended tolimit the disclosure to a particular sequence of steps or a particularcombination of hardware or software components.

[[Claim Summary to be Inserted Upon Finalization of the Claims]]

The phrases “at least one,” “one or more,” “or,” and “and/or” areopen-ended expressions that are both conjunctive and disjunctive inoperation. For example, each of the expressions “at least one of A, Band C,” “at least one of A, B, or C,” “one or more of A, B, and C,” “oneor more of A, B, or C,” “A, B, and/or C,” and “A, B, or C” means Aalone, B alone, C alone, A and B together, A and C together, B and Ctogether, or A, B and C together.

The term “a” or “an” entity refers to one or more of that entity. Assuch, the terms “a” (or “an”), “one or more,” and “at least one” can beused interchangeably herein. It is also to be noted that the terms“comprising,” “including,” and “having” can be used interchangeably.

The term “automatic” and variations thereof, as used herein, refers toany process or operation, which is typically continuous orsemi-continuous, done without material human input when the process oroperation is performed. However, a process or operation can beautomatic, even though performance of the process or operation usesmaterial or immaterial human input, if the input is received beforeperformance of the process or operation. Human input is deemed to bematerial if such input influences how the process or operation will beperformed. Human input that consents to the performance of the processor operation is not deemed to be “material.”

Any of the steps, functions, and operations discussed herein can beperformed continuously and automatically.

The exemplary systems and methods of this disclosure have been describedin relation to computing devices. However, to avoid unnecessarilyobscuring the present disclosure, the preceding description omits anumber of known structures and devices. This omission is not to beconstrued as a limitation. Specific details are set forth to provide anunderstanding of the present disclosure. It should, however, beappreciated that the present disclosure may be practiced in a variety ofways beyond the specific detail set forth herein.

Furthermore, while the exemplary aspects illustrated herein show thevarious components of the system collocated, certain components of thesystem can be located remotely, at distant portions of a distributednetwork, such as a LAN and/or the Internet, or within a dedicatedsystem. Thus, it should be appreciated, that the components of thesystem can be combined into one or more devices, such as a server,communication device, or collocated on a particular node of adistributed network, such as an analog and/or digital telecommunicationsnetwork, a packet-switched network, or a circuit-switched network. Itwill be appreciated from the preceding description, and for reasons ofcomputational efficiency, that the components of the system can bearranged at any location within a distributed network of componentswithout affecting the operation of the system.

Furthermore, it should be appreciated that the various links connectingthe elements can be wired or wireless links, or any combination thereof,or any other known or later developed element(s) that is capable ofsupplying and/or communicating data to and from the connected elements.These wired or wireless links can also be secure links and may becapable of communicating encrypted information. Transmission media usedas links, for example, can be any suitable carrier for electricalsignals, including coaxial cables, copper wire, and fiber optics, andmay take the form of acoustic or light waves, such as those generatedduring radio-wave and infra-red data communications.

While the flowcharts have been discussed and illustrated in relation toa particular sequence of events, it should be appreciated that changes,additions, and omissions to this sequence can occur without materiallyaffecting the operation of the disclosed configurations and aspects.

A number of variations and modifications of the disclosure can be used.It would be possible to provide for some features of the disclosurewithout providing others.

In yet another configurations, the systems and methods of thisdisclosure can be implemented in conjunction with a special purposecomputer, a programmed microprocessor or microcontroller and peripheralintegrated circuit element(s), an ASIC or other integrated circuit, adigital signal processor, a hard-wired electronic or logic circuit suchas discrete element circuit, a programmable logic device or gate arraysuch as PLD, PLA, FPGA, PAL, special purpose computer, any comparablemeans, or the like. In general, any device(s) or means capable ofimplementing the methodology illustrated herein can be used to implementthe various aspects of this disclosure. Exemplary hardware that can beused for the present disclosure includes computers, handheld devices,telephones (e.g., cellular, Internet enabled, digital, analog, hybrids,and others), and other hardware known in the art. Some of these devicesinclude processors (e.g., a single or multiple microprocessors), memory,nonvolatile storage, input devices, and output devices. Furthermore,alternative software implementations including, but not limited to,distributed processing or component/object distributed processing,parallel processing, or virtual machine processing can also beconstructed to implement the methods described herein.

In yet another configuration, the disclosed methods may be readilyimplemented in conjunction with software using object or object-orientedsoftware development environments that provide portable source code thatcan be used on a variety of computer or workstation platforms.Alternatively, the disclosed system may be implemented partially orfully in hardware using standard logic circuits or VLSI design. Whethersoftware or hardware is used to implement the systems in accordance withthis disclosure is dependent on the speed and/or efficiency requirementsof the system, the particular function, and the particular software orhardware systems or microprocessor or microcomputer systems beingutilized.

In yet another configuration, the disclosed methods may be partiallyimplemented in software that can be stored on a storage medium, executedon programmed general-purpose computer with the cooperation of acontroller and memory, a special purpose computer, a microprocessor, orthe like. In these instances, the systems and methods of this disclosurecan be implemented as a program embedded on a personal computer such asan applet, JAVA® or CGI script, as a resource residing on a server orcomputer workstation, as a routine embedded in a dedicated measurementsystem, system component, or the like. The system can also beimplemented by physically incorporating the system and/or method into asoftware and/or hardware system.

The disclosure is not limited to standards and protocols if described.Other similar standards and protocols not mentioned herein are inexistence and are considered to be included in the present disclosure.Moreover, the standards and protocols mentioned herein and other similarstandards and protocols not mentioned herein are periodically supersededby faster or more effective equivalents having essentially the samefunctions. Such replacement standards and protocols having the samefunctions are considered equivalents included in the present disclosure.

The description and illustration of one or more aspects provided in thisapplication are not intended to limit or restrict the scope of thedisclosure as claimed in any way. The aspects, examples, and detailsprovided in this application are considered sufficient to conveypossession and enable others to make and use the best mode of claimeddisclosure. The claimed disclosure should not be construed as beinglimited to any aspect, example, or detail provided in this application.Regardless of whether shown and described in combination or separately,the various features (both structural and methodological) are intendedto be selectively included or omitted to produce an embodiment with aparticular set of features. Having been provided with the descriptionand illustration of the present application, one skilled in the art mayenvision variations, modifications, and alternate aspects falling withinthe spirit of the broader aspects of the general inventive conceptembodied in this application that do not depart from the broader scopeof the claimed disclosure.

What is claimed is:
 1. A computer-implemented method of training aneural network for detecting target features in images, the methodcomprising: training the neural network using a first data set thatincludes labeled images, at least some of the labeled images havingsubjects with labeled features, including dividing each of the labeledimages of the first data set into a respective plurality of tiles, andgenerating, for each of the plurality of tiles, a plurality of featureanchors that indicate target features within the corresponding tile;detecting target features that correspond to the plurality of featureanchors in a second data set of unlabeled images; labeling images of thesecond data set having target features that were not detected;generating a third data set that includes the first data set and thelabeled images of the second data set; training the neural network usingthe third data set.
 2. The computer-implemented method of claim 1,wherein each of the plurality of feature anchors indicates a boundingbox within the corresponding tile that contains a target feature.
 3. Thecomputer-implemented method of claim 1, further comprising generatingthe third data set to include images of the second data set thatcorrespond to false positive detections of the target features.
 4. Thecomputer-implemented method of claim 3, further comprising generatingthe second data set to include images from videos without people.
 5. Thecomputer-implemented method of claim 1, further comprising generatingthe second data set to include images from videos that depict subjectswith different head poses.
 6. The computer-implemented method of claim1, wherein generating the third data set comprises performing arandomized crop of different aspect ratios on at least some of the thirddata set.
 7. The computer-implemented method of claim 1, whereingenerating the third data set comprises generating at least some imageshaving light levels below a low light threshold, including augmenting animage of the third data set to have light levels below the low lightthreshold.
 8. The computer-implemented method of claim 1, whereingenerating the third data set comprises generating at least some imageshaving partially occluded target features, including augmenting an imageof the third data set to have a partially occluded target feature. 9.The computer-implemented method of claim 8, wherein augmenting the imagecomprises cropping the image or inserting a block to obtain thepartially occluded target feature.
 10. The computer-implemented methodof claim 1, wherein training the neural network using the third data setcomprises training the neural network using floating point values forweights of the neural network; the method further comprises quantizingthe weights of the neural network using integers.
 11. Thecomputer-implemented method of claim 1, wherein training the neuralnetwork using the first data set of labeled images comprises normalizingRGB values of the labeled images from 0 to 255 to −1 to
 1. 12. Thecomputer-implemented method of claim 1, wherein the target featuresinclude a subject face and/or subject body.
 13. A system for training aneural network for detecting target features in images, the systemcomprising: a processor, and a memory storing computer-executableinstructions that when executed by the processor cause the system to:train the neural network using a first data set that includes labeledimages, at least some of the labeled images having subjects with labeledfeatures, including dividing each of the labeled images into arespective plurality of tiles, and generating, for each of the pluralityof tiles, a plurality of feature anchors that indicate target featureswithin the corresponding tile; detecting target features that correspondto the plurality of feature anchors in a second data set of unlabeledimages; labeling images of the second data set having target featuresthat were not detected; generating a third data set that includes thefirst data set of labeled images and the labeled images of the seconddata set; and training the neural network using the third data set. 14.The system of claim 13, wherein each of the plurality of feature anchorsindicates a bounding box within the corresponding tile that contains atarget feature.
 15. The system of claim 13, further comprisinggenerating the third data set to include images of the second data setthat correspond to false positive detections of the target features. 16.The system of claim 15, further comprising generating the second dataset to include images from videos without people.
 17. The system ofclaim 13, further comprising generating the second data set to includeimages from videos that depict subjects with different head poses. 18.The system of claim 13, wherein generating the third data set comprisesperforming a randomized crop of different aspect ratios on at least someof the third data set.
 19. An image processing system that includes aneural network implemented on a computer for feature detection,comprising: a convolutional neural network having a plurality of layersstacked sequentially, including: a first set of layers, each layer ofthe first set of layers having a depth-wise convolution and a point-wiseconvolution, wherein the first set of layers is a first subset of adifferent neural network; and a second set of layers after the first setof layers, each layer of the second set of layers having a point-wiseconvolution.
 20. The image processing system of claim 19, wherein thesecond set of layers is based upon a compression of a remainder subsetof the different neural network.